Skip to main content

ETL Principles

Best practices usually start to make sense when the team gets large and there are multiple data sources, calculation processes and users. It prevents the urge to make ad-hoc changes in order to 'solve it quickly to get it going' that eventually will tangle everything.

Load Data Incrementally

  • Extract data incrementally at regular intervals
  • Airflow makes this process easy by scheduling jobs to follow a time cadence

Process Historic Data

  • Ad-hoc workarounds are usually needed to retrieve older data for a new workflow

Partition Ingested Data

  • Partitioned data at ingestion allows for parallel DAG runs that won't get into a write lock

Enforce the Idempotency Constraint

  • Runs with the same parameters should have the same outcome on different days
  • Sometimes the processes will change and the outcome can change

Enforce Deterministic Properties

  • For a set of given input the output is always the same, cases where the function can be non-deterministic **Using external state within the function ** Operating in time-sensitive ways **Relying on order of input variable ** Implementation issues inside the function (relying on dictionary order) **Improper exception handling and post-exception behavior ** Intermediate commits and unexpected conditions

Execute Conditionally

  • Option to control tasks to run after the success of other tasks

Code Workflow as well as the Applications

  • Able to control both the workflow as well as the underlying application
  • Dynamically control DAGs within another DAG

Reset Data Between Tasks

  • What might seem like inefficient is actually intended to not allow the reading of temporary files
  • Task instances of the same DAG gets executed on different workers and they don't read from the temporary data

Understand SLA's and Alerts

  • SLA can be used to detect long running tasks
  • Airflow sends email to notify missed SLA's